Feature #3: Find Dictionary
Implementing the "Find Dictionary" feature for our "Cyber Security" project.
Description#
We use a network protocol that encrypts all application messages using a proprietary scheme. The encryption scheme has a unique property that the sequence of encrypted messages in a session appears to be in sorted order according to a secret dictionary. However, the dictionary is not transmitted for security purposes.
Before the sender starts transmitting actual messages, it sends several encrypted training messages to the receiver. The sender guarantees that the training messages will follow the lexicographic order according to the unknown dictionary.
The receiver must reverse engineer the training messages and generate the dictionary for future communication with the sender. If the order of the messages is invalid, the receiver generates an empty dictionary and asks the sender to retransmit the training messages.
For simplicity’s sake, we can assume that the encrypted contents of the messages only consist of English lowercase letters.
Let’s review a few examples below:
1 of 2
2 of 2
Solution#
We can essentially map this problem to a graph problem but, before exploring the exact details of the solution, there are a few things that we need to keep in mind:
-
The letters within a message don’t tell us anything about the relative order. For example, the message
educativein the list does not tell us that the lettereis before the letterd. -
The input can contain messages followed by their prefix, for example,
educatedand theneducate. These cases will never result in a valid alphabet (because in a valid alphabet, prefixes are always first). We’ll need to make sure our solution detects these cases correctly. -
There can be more than one valid alphabet ordering. It is fine for our algorithm to return any one of them.
-
The output dictionary must contain all unique letters within the messages’ list, including those that could be in any position within the ordering. It should not contain any additional letters that were not in the input.
Now back to the graph problem part, we can break this particular problem into three parts:
-
Extract the necessary information to identify the dependency rules from the messages. For example, in the messages provided in the slides above,
["decode", "interview"], the letterdcomes beforei. -
With the gathered information, we can put these dependency rules into a directed graph with the letters as nodes and the dependencies (order) as the edges.
-
Lastly, we can sort the graph nodes topologically to generate the letter ordering (dictionary).
Let’s look at each part in more depth.
Part-1: Identifying the dependencies#
Let’s start with an example of encrypted training messages and observe the initial ordering through simple reasoning:
As in the English language dictionary, where all the words starting with a come at the start followed by the words starting with b, c, d, and so on, we can expect the first letters of each message to be in alphabetical order as well.
Removing the duplicates, we get:
The following illustration might clarify this step:
Looking at the letters above, we know the relative order of these letters but do not know how these letters fit in with the rest of the letters. To get more information, we’ll need to look further into our English dictionary analogy. The word dirt comes before dorm. This is because we look at the second letter when the first letter is the same. In this case, i comes before o in the alphabet.
We can apply the same logic to our encrypted messages and look at the first two messages, mzsor and mqov. As the first letter is the same in both the messages, we look at the second letter. The first message has z, and the other second one has q. Therefore, we can safely say that z comes before q. We now have two fragments of the letter-order:
We don’t know yet how these two fragments could fit together into a single ordering. For example, we don’t know whether m is before q, or q is before m, or even whether or not there’s enough information available in the input for us to know.
Anyway, we’ve now gotten all the information we can out of the first two words. All letters after z in mzosr, and after q in mqov, can be ignored because they do not impact the relative ordering of the two words. To better understand this, we can think back to dirt and dorm. Because i > o, the rt and rm parts are unimportant for determining alphabetical ordering.
Hopefully, we can see a pattern here. When two messages are adjacent, we need to look for the first difference between them. That difference tells us the relative order between two letters. Let’s have a look at all the relations we can extract by comparing adjacent messages:
Notice that we did not mention rules such as
m -> a. This is fine because we can derive this relation fromm -> x,x -> a.
This is it for the first part. Let’s put the pieces that we have in place.
Part-2: Representing the dependencies#
We now have a set of relations mentioning the relative order of the pairs of letters:
Now the question arises, how can we put these relations together? One might be tempted to start chaining all these together. Let’s look at a few possible chains:
As we can observe from our chains above that some letters might appear in more than one chain and putting the chains into the output list one after the other won’t work. Some of the letters might be duplicated and would result in an invalid ordering.
Let’s try to visualize the relations better with the help of a graph. The nodes are the letters, and an edge between two letters, x and y represents that x is before y in the encrypted messages.
Part-3: Generating the dictionary#
As we can see from the graph, four of the letters have no incoming arrows. What this means is that there are no letters that have to come before any of these four.
Remember that there could be multiple valid dictionaries, and if there are, then it’s fine for us to return any of them.
Therefore, a valid start to the ordering we return would be:
["o", "m", "u", "z"]
We can now remove these letters and edges from the graph because any other letters that required them first will now have this requirement satisfied.
On this new graph, there are now three new letters that have no in-arrows. We can add these to our output list.
["o", "m", "u", "z", "x", "q", "w"]
Again, we can remove these from the graph.
Then add the two new letters with no in-arrows.
["o", "m", "u", "z", "x", "q", "w", "v", "s"]
This leaves the following graph:
We can place the final two letters in our output list and return the ordering:
["o", "m", "u", "z", "x", "q", "w", "v", "s", "a", "r"]
Let’s now review how we can implement this approach next.
Algorithm#
Identifying the dependencies and representing them in the form of a graph is pretty straightforward. We extract the relations and insert them into an adjacency list:
Next, we need to generate the dictionary from the extracted relations: identify the letters (nodes) with no incoming links. Identifying whether a particular letter (node) has any incoming links or not from our adjacency list format can be a little complicated. A naive approach would be to repeatedly iterate over the adjacency lists of all the other nodes and check whether or not they contain a link to that particular node.
This naive method would be fine for our case, but perhaps we can do it more optimally.
An alternative is to keep two adjacency lists:
- One with the same contents as the one above
- One reversed that shows the incoming links
This way, every time we traverse an edge, we can remove the corresponding edge from the reversed adjacency list:
What if we told you that we could do better than this? Instead of tracking the incoming links for all the letters from a particular letter, we can track the count of how many incoming edges there are. We can keep the indegree count of all the letters along with the forward adjacency list.
Indegree corresponds to the number of incoming edges of a node.
Let’s see how that might look like below:
Now, we can decrement the indegree count of a node instead of removing it from the reverse adjacency list. When the indegree of the node reaches 0, this represents that this particular node has no incoming links left.
We perform BFS on all the letters that are reachable, i.e., the indegree count of the letters is zero. A letter is only reachable once the letters that need to be before it have been added to the output, result.
We use a queue to keep track of reachable nodes and perform BFS on them. Initially, we put the letters that have zero indegree count. We keep adding the letters to the queue as their indegree counts become zero.
We continue this until the queue is empty. Next, we check whether all the letters in the messages have been added to the output or not. This would only happen when some letters still have some incoming edges left, which means there is a cycle. In this case, we return "".
Remember that there can be letters that do not have any incoming edges. This can result in different orderings for the same set of messages, and that’s alright.
Let’s try to visualize the algorithm with the help of a set of slides below:
1 of 19
2 of 19
3 of 19
4 of 19
5 of 19
6 of 19
7 of 19
8 of 19
9 of 19
10 of 19
11 of 19
12 of 19
13 of 19
14 of 19
15 of 19
16 of 19
17 of 19
18 of 19
19 of 19
Let’s review the actual implementation code below:
Complexity measures#
| Time Complexity | Space Complexity |
|---|---|
Let be the total number of messages in the input list.
Let be the total length of all the messages in the input list, added together.
Let be the total number of unique letters in the messages. While this is limited to in our case, we’ll still look at how it would impact the complexity if this was not the case.
Time complexity#
There are three parts to the algorithm:
- identifying all the relations
- putting them into an adjacency list
- converting it into a valid alphabet ordering
In the worst case, the identification and initialization parts require checking every letter of every word, which is .
For the generation part, we can recall that a breadth-first search has a cost of , where is the number of vertices, and is the number of edges. Our algorithm has the same cost as BFS, as it too is visiting each edge and node once.
A node is visited once all of its edges are visited, unlike the traditional BFS where it is visited once any edge is visited.
Therefore, determining the cost of our algorithm requires determining how many nodes and edges there are in the graph.
Nodes: We know that there is one vertex for each unique letter, i.e., vertices.
Edges: We generate each edge in the graph by comparing two adjacent messages in the input list. There are pairs of adjacent words, and only one edge can be generated from each pair. We can again look back at the English dictionary analogy to make sense of this:
"dirt"
"dorm"
The only conclusion we can draw is that i is before o. This is the reason dirt appears before dorm in an English dictionary. As explained in the solution, the remaining letters rt and rm are irrelevant for determining the alphabetical ordering.
Remember that we only generate rules for adjacent messages and do not add the “implied” rules to the adjacency list.
So with this, we know that there are at most edges.
We can place one additional upper limit on the number of edges as it is impossible to have more than one edge between each pair of nodes. With nodes, this means there can’t be more than edges.
As the number of edges has to be lower than both and , we know it is at most the smallest of these two values: .
We can now substitute the number of nodes and the number of edges in our breadth-first search cost:
This gives us:
Finally, we combine the two parts: for the first two parts and for the third part. As we have two independent parts, we can add them and look at the final formula to see whether or not we can identify any relation between them. Combining them, we get:
So, what do we know about the relative values of , , and ? We can deduce that both , the total number of messages, and , the total number of unique letters are smaller than the total number of letters, , as each message contains at least one character and there can’t be more unique characters than there are characters.
Now, we know that is the biggest of the three, but we do not know the relation between and .
Let’s simplify our formulation a little as we know that the bit is insignificant compared to the :
Let’s now consider two cases to simplify it a little further:
-
If is smaller than , then . We have already established that is smaller than , which is, in turn, smaller than , and so is definitely less than . This leaves us with .
-
If is larger than , then . Because , we’re left with .
So in all cases, we know that . This gives us a final time complexity of .
Space complexity#
The space complexity is or .
The adjacency list uses memory, which in the worst case is , as explained in the time complexity analysis.
So in total, the adjacency list takes space. Hence, the space complexity for a large number of letters is .
But, for our use case, where is fixed at a maximum of , the space complexity is . This is because is fixed at , and the number of relations is fixed at , so .
Feature #2: Verify Message Integrity
Feature #4: Ways to Decode Message